A Geographic Information System (GIS) describes a combination of hardware and software that can interpret and analyze spatial data. Spatial data is basically any type of data that has a location component. This might include satellite imagery, field data collected with a GPS unit, or any other dataset that incorporates some geographic variable. Photo source: GIS Geography blog
Common GIS platforms include proprietary options like ESRI’s ArcGIS mapping software, as well as open-source options like Quantum GIS (QGIS). Several other mapping software exists that specialize in 3D analysis or satellite and aerial imagery. The Earth Science Department has classes that teach GIS fundamentals and skills with some of the above platforms. There is also a GIS certificate.
GIS classes at UVU teach the fundamentals behind spatial data, along with how this unique kind of data is properly managed/analyzed for Earth science, urban studies, city and public planning, and other applications. Even when you use your cell phone to find the fastest route to campus during traffic a GIS is running under the hood to help you out.
Examples of classes offered at UVU: Intro to GIS: GEOG 3600, Remote Sensing of the Environment: GEOG 3400, Advanced GIS Analysis: GEOG 3650, Geospatial Field Methods: GEOG
While the digital mapping interface provided by the programs listed above is extremely useful, many GIS problems are more related to data science.
You remember data science right? Let’s read in a .csv file (“eq”) of earthquake data in Utah…
I downloaded this earthquake dataset from the Utah Geospatial Resource Center. A great place to look for local GIS data.
After a quick glance, we don’t see any obvious relationships between earthquake magnitude and depth over time (some of the variables in this dataset). We could plot this data differently but it wouldn’t matter much. It would also be pointless to build a model to try and predict future earthquakes in Utah.
So…what magnitude and depth of earthquakes might we predict to understand Utah’s future based on this data? Well, it’s complicated.
While predicting earthquake probability is very different from normal probability, there are some simple (as well as more complicated) methods of doing so. Additionally, while we know these earthquakes have occurred in Utah, we don’t really know where these are happening.
Luckily we can make this spatial data!
There are two main types of spatial data. The first one that most people work with is vector data. This data is stored in the form of points, lines, and polygons (basic shapes). These data types are used to represent real-world features such as earthquake epicenters (points), roads or rivers (lines), and lakes or other boundaries (polygons). Data values are assigned to this different shapes that represent real (or hypothetical) information about the world.
Let’s look at our earthquake dataset and think about how to turn this into spatial data.
class(eq)
## [1] "data.frame"
dim(eq)
## [1] 30369 6
summary(eq)
## X Long Lat Year
## Min. : 34 Min. :-114.2 Min. :36.75 Min. :1884
## 1st Qu.: 8250 1st Qu.:-112.6 1st Qu.:38.52 1st Qu.:1987
## Median :16128 Median :-112.2 Median :40.33 Median :1997
## Mean :16145 Mean :-112.2 Mean :40.02 Mean :1996
## 3rd Qu.:24066 3rd Qu.:-111.7 3rd Qu.:41.71 3rd Qu.:2007
## Max. :31909 Max. :-108.8 Max. :42.50 Max. :2016
## Mag Depth
## Min. :0.110 Min. : 0.20
## 1st Qu.:0.980 1st Qu.: 2.50
## Median :1.380 Median : 5.80
## Mean :1.454 Mean : 6.16
## 3rd Qu.:1.820 3rd Qu.: 8.10
## Max. :6.591 Max. :90.50
head(eq)
## X Long Lat Year Mag Depth
## 1 34 -111.400 42.300 1884 5.577000 5.0
## 2 204 -112.795 41.658 1934 6.591344 8.5
## 3 207 -112.745 41.571 1934 5.890994 8.5
## 4 290 -114.250 37.350 1943 3.500000 16.0
## 5 291 -114.250 37.350 1943 4.000000 16.0
## 6 297 -114.150 37.600 1944 3.500000 16.0
There are nearly 32,000 features (or locations) and I reduced the original dataset to only show the magnitude and depth recorded at these locations.
The class of the data is a data.frame but we have latitude and longitude provided at these locations. Let’s use the location information to make this spatial data.
library(sp)
eq.sp <- eq
coordinates(eq.sp)<-~Long+Lat
class(eq.sp)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
You can see that we have transformed the data with into a spatial points dataframe with the sp package. Let’s see what this data looks like
plot(eq.sp)
maps::map("state","utah",col='red',add=T)
All done right? No.
Notice how stretched the data and shape of the state look, on maps you know that Utah does not look wide like that.
This is where we need to make some real considerations about spatial data. At the moment there is no defined projection or coordinate reference system for our data. At the moment, these points are being plotted on a cartesian coordinate with latitude and longitude points.
proj4string(eq.sp)
## [1] NA
We need to add the correct information to map this data correctly. We can do so with an EPSG code rather than typing out all the necessary variables. Knowing which code to use depends on the metadata or data decription.
proj4string(eq.sp) <- CRS("+init=epsg:4269")
proj4string(eq.sp)
## Warning in proj4string(eq.sp): CRS object has comment, which is lost in output
## [1] "+proj=longlat +datum=NAD83 +no_defs"
The projection information from proj4string show part of the language of GIS. These elements define the mathematical function used to digitally recreate the surface of the Earth and also provide information about the distortion when displaying the data in 2D space. Basically, this information provides a location in the real world. Entering the incorrect spatial information would place our points in the wrong location. At this point we only have a simple coordinate reference system. The red comment is due to a change in syntax and can be avoided by switching over to the more and more widely syntax of sf package.
Projection is also important. This defines the distortion necessary to convert 3D real-world data to a 2D map. An incorrect projection of the data could result in severe distortion to area, distance, or other important spatial information. We should re-project this data to a projection that will minimize the distortion for Utah. A widely used projection for the state is UTM Zone 12N. You can look up EPSG codes here.
eq.proj <- spTransform(eq.sp, CRS("+init=epsg:32612"))
proj4string(eq.proj)
## Warning in proj4string(eq.proj): CRS object has comment, which is lost in output
## [1] "+proj=utm +zone=12 +datum=WGS84 +units=m +no_defs"
Notice that the projection information has increased after we transform it to a new projection. Some distortion is necessary, but should be minimized.
Let’s read in a pre-made spatial layer of the state’s counties from the same website above and transform it into the same projection.
library(rgdal);library(raster)
cnty <- readOGR("Utah_County_Boundaries", layer="Counties")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/mattolson/Desktop/spatdat/Utah_County_Boundaries", layer: "Counties"
## with 29 features
## It has 14 fields
class(cnty)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
ut <- raster::aggregate(cnty, dissolve=TRUE) # dissolve county layers to create state boundaries
ut
## class : SpatialPolygons
## features : 1
## extent : -114.0529, -109.0416, 36.99766, 42.00171 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
ut <- spTransform(ut, proj4string(eq.proj)) # transform to same projection
proj4string(eq.proj)==proj4string(ut) # always a good idea to build in self checks
## [1] TRUE
Now let’s plot this along with our earthquake data.
plot(eq.proj,pch=20,main="UT earthquakes (1850-2016)")
plot(ut,add=T,border='red',lwd=1.5)
Great! That looks much more accurate.
If you want to learn more about spatial projections and coordinate reference systems you will need to take a GIS class or look up more information online! These concepts are the basis for geodesy which includes the study of how GPS/GNSS works for navigation and geolocation of data.
We created a spatial object from the sp package in this section. This package runs in tandem with a package called rgdal a useful geospatial library that connects with GDAL. Working with spatial data in R requires you to install some very important external geospatial libraries first. I will load one specific library below, but you probably shouldn’t attempt to install.packages("...") before taking a few other necessary steps. I’ll discuss this a bit more later on.
Now that we have our datasets in the right projection for the area, we can perform some simple spatial analysis.
Let’s make this earthquake data a bit more useful by joining attributes between our datasets. We will calculate the maximum magnitude and depth for each county and county the number of earthquake occurences over this time period.
cnty <- spTransform(cnty, proj4string(eq.proj)) # transform the county layer to the same projection
## Warning in proj4string(eq.proj): CRS object has comment, which is lost in output
cnty.join <- data.frame(over(cnty, eq.proj[,c("Mag","Depth")], fn = max),
over(cnty, eq.proj[,1], fn = function(x) length(x)) ) # calculate the max magnitude, max depth, and number of earthquakes
cnty@data <- data.frame(cnty@data, cnty.join)
There are several ways to make spatial “joins” between datasets, above is just one (somewhat messy) example. Several different tools are capable of the same task and depending on your dataset and the desired results, your process may look very different.
Let’s create a few maps showing our results
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
cnty@data$id <- rownames(cnty@data)
cnty.df <- join(fortify((cnty)),cnty@data, by="id")
## Regions defined for each Polygons
ggplot(cnty.df, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=X),colour = "black") +
ggtitle("Utah earthquakes by county (1880 to 2016)") +
theme(plot.title = element_text(size = 6, face = "bold")) +
ylab("Northing (m)") + xlab("Easting (m)") + theme_classic() +
scale_fill_gradient(name="Count", low='blue',high='red') +
coord_fixed()
ggplot(cnty.df, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Mag),colour = "black") +
ggtitle("Maximum magnitude by county (1880 to 2016)") +
theme(plot.title = element_text(size = 6, face = "bold")) +
ylab("Northing (m)") + xlab("Easting (m)") + theme_classic() +
scale_fill_gradient(name="Magnitude", low='yellow',high='red') +
coord_fixed()
# gridExtra::grid.arrange(vis1, vis2, nrow=1)
Now we are able to see which counties
As you can see, spatial data wrangling is a bit different in R but there are great outcomes
Breaking this information up by county is useful but what if we want more precise spatial accuracy? You noticed that the our original map of Utah had a very different spatial pattern than the latest maps we created. While county data may be useful for some studies, the environment does not care about these arbitrary boundaries.
Let me introduce another type of data, raster data. This is gridded data and is very common in GIS analysis. In a gridded dataset, each grid cell receives a pixel value so coverage is much greater. The standard library in R for raster data is the raster library, however, there is a transition currently happening over to the newer terra library.
Let’s take our earthquake data and turn the point values into raster data
library(raster)
r <- raster(ut);res(r) <- 1000 # assign a "resolution" or pixel size
r <- rasterize(ut,r)
r
## class : RasterLayer
## dimensions : 559, 445, 248755 (nrow, ncol, ncell)
## resolution : 1000, 1000 (x, y)
## extent : 228582.4, 673582.4, 4094573, 4653573 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=12 +datum=WGS84 +units=m +no_defs
## source : memory
## names : layer
## values : 1, 1 (min, max)
plot(r)
Notice the extent and type of data.
r.counts <- rasterize(coordinates(eq.proj), r, fun='count', background=NA)
plot(r.counts)
plot(ut, add=TRUE)
We can also perform interpolation techniques to inform ourselves about which creates a prediction based on
library(gstat)
# nearest neighbor interpolation
gs <- gstat(formula=Depth~1, locations=eq.proj, nmax=5, set=list(idp = 0)) # idp=0 so all 5 neighbors are equally weighted
nn <- interpolate(r, gs)
## [inverse distance weighted interpolation]
nnmsk <- mask(nn, ut)
plot(nnmsk, main="Nearest neighbor interpolation of earthquake depth (km)")
# inverse distance weighting
gs2 <- gstat(formula=Depth~1, locations=eq.proj)
idw <- interpolate(r, gs2)
## [inverse distance weighted interpolation]
idwr <- mask(idw, ut)
plot(idwr, main="IDW interpolation of earthquake depth (km)")
We could continue to explore spatial analysis with cross validation and kriging methods to estimate earthquake depth or other variables in locations that we do not have data
This is only the tip of the spatial data iceberg. Many datasets have geospatial information and R can be a great tool to understand relationships in a new light. R can be used to create classification, prediction, observe land use change, and run physicaly-based models about environmental processes.
Let’s consider a simple probabilistic model to predict if an earthquake is likely to occur in the next 10 years based on the previous 10 years. If we assume poisson distribution (common for count data) and independence between years, we could estimate the probability of any earthquake in a 10 year period to be the following…
eq10 <- eq.proj[eq.proj@data$Year>=2006,] # subset to only include the last 10 years
eq.prob10 <- rasterize(coordinates(eq10), r, fun='count', background=0)
eq.prob10 <- ( 1 - (exp(-(eq.prob10/10) )^10) )
plot(eq.prob10, main="Simple 10-year earthquake probability (2017-2026)")
As one would expect, an earthquake of any magnitude is very likely along the wasatch front (a very acctive placec for earthquakes). Check out the live and recent Utah earthquake maps to convince yourself that this model, while simple, is true.
In reality, earthquake probabilities are not so straightforward and independence between years cannot be assumed. This is one type of model that an Earth scientist may be interested in. We could also subset this to look at earthquakes of a given (significant) magnitude occurring in an area, which would make our model more useful and specific.
There are several raster datasets available all over the place, whether you are interested in bioclimatic variables (like this dataset from WorldClim) for ecological modeling or other data related to your field.
The raster package allows you to easily download some useful datasets to get started. Let’s take a look at the bioclimatic variables. For variable names, see the website above.
biol <- getData('worldclim', var='bio', res=10)
biol
## class : RasterStack
## dimensions : 900, 2160, 1944000, 19 (nrow, ncol, ncell, nlayers)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## names : bio1, bio2, bio3, bio4, bio5, bio6, bio7, bio8, bio9, bio10, bio11, bio12, bio13, bio14, bio15, ...
## min values : -269, 9, 8, 72, -59, -547, 53, -251, -450, -97, -488, 0, 0, 0, 0, ...
## max values : 314, 211, 95, 22673, 489, 258, 725, 375, 364, 380, 289, 9916, 2088, 652, 261, ...
Notice that we now have a raster stack that contains several bioclimatic variables that we could use in a spatial analysis.
Let’s project this data to match our Utah boundary vector file so we can crop (cut) the larger dataset to Utah.
plot(biol[[1]],main="Mean annual temperature (Global)",ylab="Latitude",xlab="Longitude")
biol1_ut <- projectRaster(biol[[1]],crs=CRS(proj4string(ut))) %>% crop(ut) %>% mask(ut) # [[1]] is indexing to select the first layer (biol1)
plot(biol1_ut, main="Reprojected Mean annual temperature (Utah) - 10 minutes of a degree resolution")
plot(ut,add=T)
Notice how coarse the resolution is, this can also be downloaded with 0.5 minutes of a degree as well.
Another great resource is remote sensing data. For example, we can also download digital surface elevation extracted from satellite-based radar (also through the raster package)
elev <- getData("alt", country='USA')
plot(elev[[1]], main="USA Elevation (meters)")
elev_ut <- projectRaster(elev[[1]],crs=CRS(proj4string(ut))) %>% crop(ut) %>% mask(ut) # crop/mask
plot(elev_ut, main="Utah elevation (meters) & 2016 earthquakes")
plot(eq.proj[eq.proj@data$Year==2016,], col='red',add=T)
Combining datasets, further spatial analysis, and modeling are great avenues for exploring GIS capabilities in R. There are also other great mapping packages that will allow you create maps with a google or other basemap in the background, such as ggmap.
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
cnty0 <- readOGR("Utah_County_Boundaries", layer="Counties")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/mattolson/Desktop/spatdat/Utah_County_Boundaries", layer: "Counties"
## with 29 features
## It has 14 fields
ut_eq <- eq.sp %>% crop(cnty0[29,]) %>% as.data.frame()
qmplot(Long, Lat, data = ut_eq[,c(2:3,5)], colour = I('firebrick'), size = I(2), darken = .3, main)
## Using zoom = 9...
## Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
## Warning: Ignoring unknown parameters: NA
To learn more about GIS and satellite/aerial remote sensing datasets sign up for a GIS class (GEOG 3600)
This section is crucial to read
You’ve seen some of the geospatial libraries that are used and there are many others, these include sp, rgdal,raster,rgeos, sf, terra, and many more.
One important note about these libraries, they depend on external libraries such as GDAL, PROJ.4, and GEOS.
Here is the first page of a workshop that provides links to the different software. Installing each the correct way is not always trivial! Particularly on a MacOS. Read the information online carefully.
My recommendation would be to install QGIS on your computer for the digital mapping abilities, but also because it should install all these necessary libraries. Some python installations, llike anaconda can also facilitate building these libraries correctly for your OS. Just like with anything, learning spatial data science requires a lot of time and patience. Get your zen on and go explore spatial data!
Good luck!